Software Optimization by Code Rewriting

There are multiple ways to optimize software that is too slow or too big (uses too much memory): make individual pieces of code more efficient, speed up computation of common cases, or store results for reuse. Here I describe ways to speed up small sections of code. See Software Optimization by Pattern Matching for ways to specialize code for common cases and Software Optimization by Data Caching for ways to store results for reuse.

A first implementation of an algorithm is often simplified in some ways to reduce development time. For most programs, where the performance "hotspots" are not known ahead of time, this is the best strategy. "Make it right before you make it fast (or small)." Highly optimized code tends to be more complex, and increased complexity leads to a higher bug rate for two reasons: there are more lines of code, and the number of errors per line of code tends to be higher for code that is not straightforward.

Once you have working software, profiling will tell you where the program spends most of its time (or uses the most memory). This original implementation provides a reference against which you can compare the improved software for correctness and efficiency. If there are differences, you can single-step through the implementations in parallel to identify the point at which the differences begin. If a trial solution is slower or bigger, you can use this information to refine your understanding of the software.

By far, the most common reason why a program runs slower than it could is an inefficient operation of some kind. Often this is a sort or a search of some kind. Volume 3 of Donald Knuth's famed "The Art of Computer Programming" book series is named "Sorting and Searching" for a very good reason - organizing and finding data lie at the heart of many software tasks.

Loop Rewrites

It takes time to traverse any large set of data. Even if the data fits into memory, a modern multi-issue CPU can execute dozens of instructions while waiting for a word of data to be loaded from off-chip memory (this is why on-chip cache memory is so important). Thus it is important to minimize the amount of memory accessed during a sort or search.

There are many ways to store and retrieve data, of course. One of the simplest is an array. Searching an array is very simple: examine each entry in turn, stopping when the desired element is found. Costs for array accesses can add up quickly, however:

On average, half of the array must be examined to find a specific object;
Nested search loops increase run time by a multiplicative factor for each nesting level; and
Examining widely scattered fields in each record increases cache memory usage.

If you are processing each element in an array, then you can get each one in sequence at a low fixed cost. Often, you want to process one or a few elements in the array based on some criteria. If the array is large, the search cost may grow to an unacceptable level. Often you won't know that the array can get large until you begin testing the software in the real world.

Nested search loops very quickly increase run times. Sometimes these are not obvious when the code is written, because good structured programming techniques hide implementation details. If a higher level loop invokes a function that also has a loop, total run time will be the product of the two loop counts. A third loop level will result in another multiplication.

Take a look at each level of the nested loop. If the parameters used in a lower level loop do not change much, and if the resulting set of results is relatively small, pull the lower level loop out and store the results temporarily. If the original code looks like this:

for (i = 0; i < limit1; ++i) { val1 = array1[i]; if (condition1) { for (j = 0; j < limit2; ++j) { val2 = array2[j]; if (condition2) { // do some processing on val1 and val2 } } } }

Then you can try this:

for (i = num1 = 0; i < limit1; ++i) { val1 = array1[i]; if (condition1) array1a[num1++] = val1; } for (j = num2 = 0; j < limit2; ++j) { val2 = array2[j]; if (condition2) array2a[num2++] = val2; } for (i = 0; i < num1; ++i) { val1 = array1a[i]; for (j = 0; j < num2; ++j) { val2 = array2a[j]; // do some processing on val1 and val2 } }

This works well if condition2 is a constant or nearly so (if it is not constant, repeat it within the nested loop so that each val1 is processed only with an appropriate val2). The idea is that num1 is much less than limit1 and num2 is much less than limit2, so that their product is small compared to the product of limit1 and limit2.

If the levels of code are in separate functions, consider merging all of them into a single function. This is of course less readable, so it should be done with caution. Do this only after the code is working so that you have a reference implementation.

The key is that optimization opportunities such as the one described above present themselves when all of the levels are together. You may find that values are recomputed, or that searches are repeated. Calculations may be independent of each other, allowing you to perform multiple operations per iteration.

Nested loops in separate functions may be unavoidable, but if the lower level search routines are small, function call overhead may be significant. Try implementing the lower levels inline (either by copying the code or, in C++, exposing the functions in the object header and marking them as inline).

Indexing

Indexing or sorting data based on a frequently used key value is a common way of reducing run time when individual items must be retrieved. Keys that can be reduced to a single numeric value allow you to use fast tree-oriented or hash-based search mechanisms. These reduce linear costs to logarithmic or even fixed-time costs. Of course, they are most practical when that key value is the only one used for searching or sorting. Otherwise multiple trees or hash tables must be set up, increasing memory usage (and code complexity) considerably.

If one key value is used more than any other, you can apply several strategies:

Use an array, but sort it based on that key value and use a binary search whenever that key is the determining factor. This is most effective when changes to that key (including creation or deletion of objects) are rare, so that sorting is done rarely or (better yet) only once at the start.
Set up a parallel tree or hash table for accesses using that key value, but keep the array. This is often a good compromise between memory usage and run time.

In both cases, you would use a linear search for the (hopefully infrequent) searches based on other keys within the records.

Data Structure Modifications

Modifications to data structures can yield incremental improvements, even if they are not as dramatic as reducing polynomial runtimes to logarithmic or linear runtimes. These techniques are complementary to other runtime improvements, so you can use them at almost any time.

Defining Your Own Data Structures

One of the simplest ways to improve the efficiency of data structures is to define your own. The templates in the Standard Template Library (STL) or Boost library are convenient, but they are written for very general use and may have hidden costs in your program. They make very heavy use of generic programming and composition, which can interact badly.

For example, consider the following C++ code fragment:

std::vector<std::string> ary; std::sort(ary.begin(),array.end());

This seems straightforward, but if the array is large, the sort will be very slow. Why? Sorting requires exchanging elements, typically O(n * log₂ n) or O(n * sqrt(n)) times. An exchange requires three lines of code:

tmp = a; a = b; b = tmp;

When exchanging two std::string objects, three copy operations will be performed. Each will require memory allocation, object construction, data copying, object destruction, and memory deallocation. When there are a million strings in the array, this can be expensive!

On the other hand, the C equivalent is much faster:

char **ary; int num_elems;

qsort(ary,num_elems,strcmp);

The exchange still requires the three lines of code above, but each line simply moves a pointer, so it can be a single instruction. The sort requires the same number of operations, but they are so much cheaper that you will be able to move on to the next hotspot.

In the program from which I took this code fragment, I was able to switch the program's data representation from C++ strings to the C (char *) equivalent easily. You may need to copy the strings into (char *) pointers temporarily for the sort, then convert back.

Composition of C++ classes as shown above has other costs. As you add strings to the vector, it will have to be resized. Once again, each string in the old array will have to be copied to the new array, with a cost of allocation, construction, copying, destruction, and deallocation. Growing an array of (char *) pointers requires only that the pointers be copied.

The cost of using the Standard Template Library is so high that I generally will not use it in high-performance code. A little convenience is not worth the loss of control over runtime.

Improving Data Structure Cache Usage

Reordering fields within data structures can also help. Quite often a sort or search will use only a few fields within a data structure. Every time you access a field, the processor must load the memory for that field into cache.

Cache lines are long - 32 to 64 bytes is typical - so you may get a lot more data than you really need. If you need to look at three widely separated fields, you may get 192 bytes loaded into the cache only to use 12 bytes. Placing the fields close together within the record results in a single cache line load, and 12 of the 64 bytes would be used.

Tools like the free software tool cachegrind (part of the valgrind suite) can help you determine when your data structures could be repacked profitably. Learn how to use them, so you have a way to tease out subtle patterns that are not apparent with ^C Optimization.

Refactoring

Refactoring is code restructuring that does not change the function of the code but which improves it in some way. Here we want to improve the speed or memory usage of a program.

To improve speed, you can group related fields together to improve cache usage, using a cache simulator as described above. You can also precompute and store frequently used values. Both methods require experimentation to determine which fields or values work best.

To improve memory usage, you can group seldom-used (but related) fields together, then create a data structure to hold them. This data structure would be allocated only when necessary. When the use fraction is low enough, the per-record savings will exceed the memory allocation overhead (which can be as low as one pointer when a zero-overhead memory allocator is used). Again, experimentation is necessary; you may find that there will be several groups of fields and thus several records.

This method works very well in C++, where you can protect access to member fields. The old interface functions can be retained, providing guarded access to the new data structure. The code that calls them can be modified on an as-needed basis to use the new data structure and its fields directly.

Line-by-Line Optimization

Modern optimizing compilers can make many improvements within individual lines of code, such as expression strength reduction and hoisting (moving constant calculations outside of a loop), but they can't find everything. You may be able to find patterns that the compiler writers have not seen yet, or make analyses that a compiler cannot.

For example, if you know that a function always returns the same value given the same inputs, you can move the function call outside of the loop. A compiler generally cannot do this unless the function is an inline C++ member function (even const C++ member functions can modify state somewhere).

Sometimes you can replace a call to a library function with your own, more-efficient version. For example, if you know the length of a string in a buffer and the length of a string to append to that buffer, you can use strncpy() and then update the buffer length rather than use strcat(), which must search for the end of the destination buffer and must also watch for the end of the string being appended. strncat() can run blindly - and thus quickly - for the specified number of bytes.

Finally, you can precompute and temporarily store values being used in a CPU-intensive algorithm. I once wrote a general-purpose radix sort that would read a numeric key from any arbitrary record (the offset was passed to the sort routine). For 32-bit (four-byte) integers, this numeric key would be read four times per record. Storing the integer locally (instead of rereading it each pass) reduced runtime of the sort routine by 6%!

Conclusions

Not all performance enhancements require sophisticated algorithms like pattern matching or caching. You can get significant improvements from relatively straightforward changes to individual functions. Because these kinds of changes are inherently local, you should try them first before trying the "major surgery."

It may seem that some of these enhancements are too small to be worthwhile, but they do add up. Automobile gas mileage ratings have been creeping upward over the years as manufacturers made improvements 1% at a time. Considering the power consumption of computers in server farms today, the computing industry would be wise to pursue a similar strategy - faster programs use fewer servers and less energy. And like it or not, code lasts a long time. You're not going to write a newer, faster implementation each year.

Chapman Consulting

Software Development Done Right.