A nice aspect of formal computer science education is that it requires writing useless programs in a low-level programming language like C which pass automated tests with harsh time and memory limits. The following list shows several methods which I used to optimize such programs and examines their usefulness for ‘real’ programs.
1. Don’t optimize uncompleted programs
Eric S. Raymond explains this by the fact that incomplete programs are not understood completely, so a thing considered not an optimization may make the program slower when it will be completed. In case of small programs for batch processing of data there is another reason – slow but easy to understand program may generate output which will be compared to output of optimized versions.
For real programs there is another advantage – a simple prototype might be used to describe the design and allow much faster development than a fast program written in C. Once I wrote a trivial Python script to explain a limitation of standard Unix program uniq. It was so simple that in next few days I decided to implement a complete uniq-like program in C.
2. Don’t optimize fast enough programs
In case of homeworks the aim is usually to make a program passing some automated tests. If it is possible to test the program many times, it would be not useful to optimize it after all tests are passed in required time.
In real life programs are used with more than several different input files, so this argument does not apply to non-homework programs. But there the Moore’s Law makes most optimization pointless.
3. Measure before any change
It is very difficult for humans to predict performance effect of a change in a program. For example, C programs with a loop consisting of billions of pre- or postincrements of a variable saved to another variable produce on my machine assembly code differing only by the order of instructions in the loop. It looks obvious that these programs would work for the same amount of time, but the one with preincrements was faster by about 25% when using an AMD Phenom processor. In this case the correct solution was to enable compiler optimizations which removed the whole loop, but in many situations it is better to make small changes and compare their effect on the program performance.
In real Unix programs profilers are used for this, but for programs containing only one function it is appropriate to just measure the time spent running the whole program on a large input.
If a typical input takes one microsecond to process, then it might be necessary to write a script making much larger inputs. Since they would be too large to have output correctness verified by human, an unoptimized ‘correct’ implementation of the program being improved is useful to generate output for regression testing.
4. Don’t use dynamic memory allocation
As explained by Joel Spolsky, malloc and similar functions use a slow algorithm to decide which memory is free. This probably could have been improved since 2001, but still is slow. Fortunately many homework problems specify limits on input size which may allow using statically allocated arrays for everything.
In real programs malloc is often avoided by ignoring characters after the 2047 column of a line or by crashing the program when they occur. An advantage of many GNU programs is that they do not limit input length and dynamically allocate input buffers. Therefore this tip is mostly useless for non-homework problems.
5. Don’t copy memory
CPUs are thousands of times faster than memory. Therefore operations on large amounts of memory become the performance bottleneck of programs without significant I/O.
Performance of a program can be improved by using a single buffer for e.g. an input line, instead of copying it many times. In some cases code reading the input might do some calculations on it which will lead to storing much less data in memory. This might be a useful improvement in case of programs performing O(n) algorithms on these data.
6. Write small and simple programs
Since memory is slow, modern CPUs store some of it in faster and more expensive caches. Therefore smaller code might be stored completely in cache and avoid slow memory access, as explained by Raymond.
This is done mainly by removing optimizations designed for older processors. A nice example of this is the removal of many uses of Duff’s device from the XFree86 X11 server. As stated by Theodore Ts’o, ‘by eliminating all instances of Duff's Device from the XFree86 4.0 server, the server shrunk in size by _half_ _a_ _megabyte_ (!!!), and was faster to boot’.

Leave a comment