2009-01-16

Profiling, optimizing, and carpentry

Ok, actually this post really isn't about carpentry, except that the old adage "measure twice, cut once" is equally important in optimizing code. Different than in carpentry, there are many variables when measuring code performance and that's what this post is about.

I'll start out with describing the particular incident that caused me to think about this. A few days ago I was working on a little hobby program that has an area that is frequently accessed and which makes up a significant portion of the program's total running time. Basically I have an object with a set of byte data and I was considering two ways to store that data: as a simple multi-dimensional byte or as a much small single dimension long (I'm running on a 64bit CPU) array that I would index into to extract single byte values.

I suspected that the long[] method would be slower since it is less direct that a straight byte[], but there's another piece to the puzzle. As well as accessing these byte values I have at other times the need to compare one set of values to another. My thinking was that if the long[] method was a little slower than the byte[] method in access that loss might be offset by a much faster comparison.

Anyway, on with the story... So I created a test case to try both methods a few million times and return performance metrics. Since my concern with the long[] method was that the calculations to index and extract a byte value from the long values might be a performance hit I decided to have my test case use random indexes to exercise indexing into various portions of the data.

I ran the tests and found the byte[] method to be marginally faster than the long[] method. The ratio was about 7:8. I must admit I was pleasantly surprised and suspected that the compiler might have made some pretty good optimizations to achieve this, but I was happy to accept the result.

I then preceded to implement the long[] method in the rest of my code to test its performance in a more real-world scenario. I actually decided to use a few #ifdef areas so that I could switch between the original byte[] method and long[] method to compare the two. This implementation took about an hour and when I finally ran some real benchmarks I found the long[] method to be about 1/5th as fast as the byte[] method.

1/5th?! The test cases showed a performance hit of 8:7, not 5:1. What's up with that? So, I went back to the performance test and considered the possibilities. To keep a short story short, the problem was the random index generation. The random generator time so much outweighed the data access time that my initial performance measurement results were drastically skewed.

Why would that be? Well, it works something like this. If the actual performance of just my data access code is 5,000 ops/sec for the byte[] and 1,000 ops/sec for the long[] method that gives me the 5:1 ratio the real world tests showed. If, however, the random generation adds overhead such that I get only 80 ops/sec for byte[] and 70 ops/sec for long[] then that gives the 7:8 ratio. In fact, after I removed the random index generation the performance test gave me expected 5:1 results.

So, the moral of the story is make sure you are measuring what you think you're measuring. What I probably should have done was run my test case through a profiler. That would have revealed that a significant portion of the test run time was the random number generation and it also should have shown numbers for the data access methods that were much closer to the actual 5:1 performance ratio.